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Ancient human remains of paleopathological interest typically contain highly degraded DNA in which 
pathogenic taxa are often minority components, making sequence-based metagenomic characterization 
costly. Microarrays may hold a potential solution to these challenges, offering a rapid, affordable, and highly 
informative snapshot of microbial diversity in complex samples without the lengthy analysis and/or high 
cost associated with high-throughput sequencing. Their versatility is well established for modern clinical 
specimens, but they have yet to be applied to ancient remains. Here we report bacterial profiles of 
archaeological and historical human remains using the Lawrence Livermore Microbial Detection Array 
(LLMDA). The array successfully identified previously- verified bacterial human pathogens, including 
Vibrio cholerae (cholera) in a 19th century intestinal specimen and Yersinia pestis ("Black Death" plague) in 
a medieval tooth, which represented only minute fractions (0.03% and 0.08% alignable high-throughput 
shotgun sequencing reads) of their respective DNA content. This demonstrates that the LLMDA can 
identify primary and/or co-infecting bacterial pathogens in ancient samples, thereby serving as a rapid and 
inexpensive paleopathological screening tool to study health across both space and time. 



Research into the origins of infectious diseases and population health through time faces many challenges, 
such as biased archival records and ambiguous paleopathological skeletal indicators of actual pathogen 
infection levels 1 . Despite its inherent fragility, ancient DNA (aDNA) remains a highly informative paleo- 
pathological study target, having been recovered and characterized from a variety of contexts, age depths and 
specimen types 2 . Recently, high-throughput sequencing (HTS), often coupled with targeted enrichment (TE), has 
allowed for the recovery of large genomic targets from archaeological specimens, including full pathogen gen- 
omes 35 . However, TE-HTS is only useful when the primary pathogen(s) are known or suspected to be present, 
and necessarily ignores non-targeted taxa and genomic loci. This is problematic because the primary pathogenic 
agent in an ancient paleopathological specimen can be elusive, and furthermore the entire microbiome likely 
played a significant role in past human health, as it does today 6 . Therefore establishing detailed levels of com- 
mensal and co-infecting pathogens is essential for accurately reconstructing past epidemics, population health, 
and disease susceptibility. As such, for paleopathologists wishing to examine changes in microbial co-infection 
levels across space and time, more comprehensive metagenomic characterization is necessary. One way to achieve 
this is by sequencing amplicons of conserved loci (such as 16S rRNA) that can to a degree measure the meta- 
genomic content of a sample. However, by design, amplicon datasets ignore potential taxonomically- informative 
diversity in more variable genomic regions, and for that matter can be biased by polymerase or disparate target 
abundances 7,8 . Metagenomic "shotgun" HTS on the other hand is arguably the most comprehensive and least 
biased method currently available for total microbial characterization for modern and aDNA specimens 9,10 , but 
very deep sequencing is often required to identify pathogens confidently. While certainly powerful, both of these 
metagenomic approaches can be labour- and time-intensive, thereby representing significant barriers for groups 
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Figure 1 | Number of bacterial families detected by HTS and/or LLMDA. Number of bacterial families (or less-specific higher taxonomic level) detected 
by HTS sequencing (green circles) and LLMDA analyses (blue circles). Families detected by both methods are indicated where the circles overlap. 
Values above the midline include all detected families, whereas values below the midline include only those families used for LLMDA probe design. Image 
of cholera victim specimen 3090.13 was taken by AMD; image of plague victim specimen 8291 was taken by HNP. 



that would like to thoroughly profile or screen the microbial content 
of large or difficult paleopathological sample sets. 

One potential technological solution to this issue is the microarray, 
which over the past two decades has been used for the large-scale 
study of gene expression and genie content of simple and complex 
samples 11 . Microarrays are glass slides densely spotted with clusters of 
single- stranded synthetic oligonucleotides that are allowed to hybrid- 
ize with fluorophore-labeled DNA from a sample, and the resulting 
fluorescence signals are interpreted to determine sequence composi- 
tion and/or taxonomic content. Recently, microarrays designed spe- 
cifically for characterizing the microbial content of complex samples 
have been successfully used (e.g. 1217 ), particularly in cases where tra- 
ditional clinical methods are inconclusive, time-consuming, and/or 
expensive 17 . Microarrays can contain up to millions of unique oligo- 
nucleotides and their use and analysis involve low processing time and 
cost 14 . Therefore, they potentially provide a more practical alternative 
to metagenomic HTS for characterizing the microbial content of 
paleopathological specimens. However, microarray detection tech- 
niques have not yet been applied to aDNA extracts, which due to 
short fragment length and base damage may present challenges. 

To assess the potential value of microarrays for pathogen detection 
in ancient samples, we compared microbial profiles of two archaeolo- 
gical human specimens generated with a recently- developed pathogen 
detection microarray to profiles generated with standard metage- 
nomic HTS analysis. For microarray analysis, we used the Lawrence 
Livermore Microbial Detection Array (LLMDA) designed by the 
Lawrence Livermore National Laboratory 12 , one of several array plat- 
forms developed in the last decade to identify pathogens in experi- 
mental mixtures and clinical samples 14 . The LLMDA v5 12-plex 135K 
array contains probes designed from all published vertebrate- infecting 
pathogen genomes. LLMDA probes target conserved regions amongst 
all known species/strains of a family (or equivalent unit), but due to 
the high number and overall diversity of probes, unique combinations 
of matching probes across an individual genome sequence allow for 
species or strain identification. Florescence data are analysed using a 
likelihood maximization algorithm to identify the combination of 
species that best explains the resulting signal. To achieve this, each 
signal set is compared against a current database of full microbial 
genomes and analysed for the expected vs. detected combined probe 
fluorescence signal, resulting in a species list ranked by likelihood of 
presence. If desired, these results can then be parsed to calculate 
overall likelihoods of higher taxa presence by summing the likelihoods 
of relevant species-level hits (see Supplementary Information for full 
description). The specimens we analysed here were a preserved 
medical intestine sample from an 1849 AD cholera victim (specimen 



3090.13) 3 and a tooth from a 1348AD Black Death plague victim 
(specimen 8291) 4 . Both were previously confirmed with TE-HTS to 
contain their relevant pathogens, though they constitute very low 
levels in shotgun HTS datasets (3090.13: 0.03% alignable with bow- 
tie 18 to Vibrio cholerae, the etiological agent of cholera; 8291: 0.08% 
alignable to Yersinia pestis, the etiological agent of the plague or Black 
Death). Both of these pathogens' families (Vibrionaceae and 
Enterobacteriaceae) have probes on the LLMDA, and therefore spe- 
cies in these families should be detectable. We specifically assessed (1) 
whether LLMDA would detect the previously determined pathogens 
in our ancient samples, (2) which additional bacterial taxa were 
detectable by both LLMDA and HTS, and (3) which bacterial taxa 
were detected by either LLMDA or HTS alone. 

Results 

We have restricted our taxonomic comparisons to bacteria, since the 
sequencing libraries were built from DNA only and thus not appro- 
priate for a complete viral survey. While both HTS and LLMDA 
analyses are capable of species-level identification (as LLMDA ana- 
lysis calculates the likelihood of presence for individual species/strain 
genomes), for the purposes of this paper we have focused on family- 
level identification for ease of comparison. Note that the v5 12xl35K 
LLMDA probes were derived from all complete vertebrate-infecting 
pathogen genome sequences available at the time of design 
(December 201 1); however, as the hybridization patterns were inter- 
preted using an updated database (April 2012), probes may match 
new genomes from other taxa (even non-vertebrate infecting species) 
and therefore potential taxonomic calls are not limited to those used 
for probe design. For the metagenomic HTS data, taxonomic assign- 
ments were identified by BLAST (blastn-megablast) 19 and 
MEGAN4 20 analysis against the National Center for Biotechnology 
Information (NCBI) RefSeq genome database 21 (October 2012). A 
schematic comparison is provided in Fig. 1 and results for both 
methods are given in Table 1 and Tables SI and S2. 

Taxa detected by both LLMDA & HTS. For cholera victim 3090.13, 
twenty-one bacterial families were detected by both LLMDA and the 
118 million BLASTed HTS reads from the sample (Fig. 1), 
representing 36.8% and 40.4% of the families called by each 
respective method. For plague victim 8291, fifty- three families 
were detected by both approaches, representing 89.8% and 27.9% 
of the families called by LLMDA and 83 million HTS reads, 
respectively. When we considered only the families with specific 
probes designed for them on the LLMDA array, we detected 19 
families in the cholera victim 3090.13 by both LLMDA and HTS, 
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Table 1 | Summary of LLMDA and HTS results. Only taxa with probes designed on the LLMDA array are shown (see TableSl for full results). 
Only taxa with at least 5 reads are called with HTS-MEGAN4 analysis. Reads = number of HTS reads assigned to that taxonomic level (- = 
not found in HTS dataset). LO score = LLMDA log odds score (- = not called with LLMDA). Phyla abbreviations = Act, Actinobacteria; Bac, 
Bacteroidetes; Chla, Chlamydiae; Chlo, Chlorobi; Chi, Chloroflexi; Fib, Fibrobacteres; Fir, Firmicutes; Fus, Fusobacteria; Pro, 
Proteobacteria; Spi, Spirochaetes; Syn, Synergistetes; Ten, Tenericutes; The, Thermotogae; Ver, Verrucomicrobia 
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representing 41.3% and 79.2% of the families called by each 
respective method and 46 families for the plague victim 8291, 
representing 92.0% and 56.8% of the families with probes on the 
array (LLMDA and HTS). 

The taxa detected by both methods included many groups with 
relatively high read counts in the HTS data (e.g. Aeromonadaceae 
and Enterobacteriaceae for 3090.13; Burkholderiaceae, Comamona- 
daceae, and Pseudomonadaceae for 8291). In addition, both methods 
detected the previously confirmed significant pathogens to the species 
level within the calls for their respective families (see Fig. SI and Table 
S3 for the LLMDA probe data supporting these pathogen calls). For 
3090.13, 10,379 (0.009% of BLAST reads) were V. cholerae, and 
LLMDA called the family Vibrionaceae with V. cholerae chromosomal 
sequences at a high log odds value (4,470.7). The detection of only V. 
cholerae chromosomal sequences amongst Vibrionaceae is not unex- 
pected, given that Vibrionaceae species are aquatic and only oppor- 
tunistically infect humans 22 . For 8291, 1,272 (0.001% of BLAST reads) 
were Y. pestis, and LLMDA called the family Enterobacteriaceae 
including Y. pestis pPCPl plasmid sequences (among multiple other 
species) at a high odds value (1,640.8). Y. pestis pPCPl was detected 
amongst a variety of other Enterobacteriaceae, which is a large family 
containing many normal microbiomic species such as E. coli 23 . The 
detection of pPCPl is not surprising, given that the plasmid is both 
highly specific to Y. pestis and is found at high copy number within the 
bacterium 24 , which may have facilitated detection. 

Taxa detected by only one method. BLAST analyses of HTS reads 
identified many bacterial families that were not detected by LLMDA 
analyses (for cholera victim 3090.13, n = 10, 32.3% of all HTS; for 
plague victim 8291, n = 137, 72.1% of allHTS), such as Neisseriaceae 
and Shewanellaceae in sample 3090.13, Cellulomonadaceae and 
Rhizobiaceae in sample 8291, and Fusobacteriaceae and Peptostrep- 
tococcaceae in both samples. Likewise, LLMDA analysis identified 
many families that HTS did not (for 3090.13, n = 36, 63.1% of all 
LLMDA; for 8291, n = 6, 10.2% of all LLMDA). However, when 
excluding taxonomic groups without specific probes represented on 
the array. However, when excluding taxonomic groups without 
specific probes represented on the array, only 5 families were 



detected by HTS alone (20.8% of all HTS) for sample 3090.13 and 
35 (43.2%) for 8291, while for LLMDA alone, 27 (58.7% of all 
LLMDA) were detected for 3090.13 and 4 (8.0%) for 8291. 

Discussion 

Figure 2 (cholera victim 3090.13) and Figure 3 (plague victim 8291) 
display the MEGAN4 output of the NCBI taxonomy for all family- 
level taxa identified with BLAST analysis of the HTS data and 
whether they were also detected with LLMDA. Overall, the 
LLMDA profiles reflect the major HTS-identified components well. 
Not only were the previously-identified pathogen bacterial species/ 
families detected via both methods, but a number of major envir- 
onmental, microbiomic and pathogenic taxa were identified to at 
least the order level (e.g., Actinomycetales, Bacilliales, Clostridiales, 
or Rhizobiales). While promising, a number of disparities between 
the profiles generated by each method encourage further investiga- 
tion into their origin, discussed below. 

When comparing metagenomic profiles generated by each method, 
it is important to be aware of the fundamental differences in their 
taxonomic identification strategies. For the analysis of BLAST output 
from HTS data, default parameters in MEGAN4 require five sequence 
reads to assign a taxon as being present; furthermore, the reads do not 
have to be assigned to the same species for family-level calls 20 . 
MEGAN4 also gives equal weight to read mappings that are concen- 
trated in narrow regions of a target genome, which are inherently less 
specific as indicators of the target's presence. A common possible 
scenario leading to false positive taxon assignments could occur in 
both HTS and microarray analysis, when reads or probes map to 
ribosomal RNA or housekeeping genes that are relatively conserved 
between related taxa. Microarray probes can be designed to avoid 
these conserved regions, but in general sequence reads mapping to 
such regions are not filtered out in metagenomic analysis. Therefore, 
BLAST/MEGAN4 analysis of HTS data emphasizes sensitivity at the 
expense of specificity. The CLiMax algorithm used for LLMDA ana- 
lysis requires that a family satisfy more stringent criteria to be con- 
sidered present. The initial CLiMax analysis is performed at the target 
genome level rather than the family level; for a target to be called 
present, a minimum of 4 probes or 20% of the probes matching a 
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Figure 2 | Comparison of HTS andLLMDA results for cholera victim 3090.13. Cladogram based on NCBI Genbank taxonomy indicating results of the 
BLASTN/MEGAN4 HTS analysis at the family level and above compared to LLMDA results. At the leaves, circle size reflects the relative number of 
reads assigned to those taxa (internal node sizes only indicated if > 10 reads). Colors of taxon names indicate whether that taxon had ( 1 ) reads present in 
the HTS data, (2) probes designed for that family on the LLMDA, and (3) LLMDA call for that taxon. Bacterial phyla and major clades are highlighted. 



specific genome (whichever is greater) must have intensities above an 
array-specific significance threshold. In addition, targets for which the 
high intensity probes are concentrated in narrow genomic regions are 
filtered out as potential false positives (see Supplementary Information 
for description of methods). When this filtering is removed, or if the 
minimum probe criteria are relaxed, CLiMax predicts the presence of 
several previously undetected families (data not shown). However, our 
previous experiments in which the LLMDA was hybridized to samples 
of known microbial content indicate that stringent filtering is neces- 
sary to avoid false positives 12 . Therefore, the CLiMax analysis is much 
more conservative in its predictions than BLAST/MEGAN4 analysis, 
emphasizing specificity over sensitivity, and possibly explaining some 
of the apparent undetected taxa in the LLMDA data. 

Several taxa detected with HTS were not detected with LLMDA. 
Many of these are unsurprising, as no probes designed from their 
genomes were present on the array. However, for those taxa with 



probes on the array, one possibility is that the LLMDA is simply not 
as sensitive as HTS at these sequencing depths: in plague victim 8291, 
taxa not detected with LLMDA had significantly fewer HTS reads 
than those that were (one-tailed, unequal variance Student's t-test, p 
= 0.002; Fig. 4a), though these variables are not significantly related 
for the cholera victim sample 3090.13 (p = 0.076). Furthermore, 
several taxa with relatively high read counts and with probes 
designed on the array were surprisingly not called (e.g., 
Sphingomonadaceae in sample 8291; Peptostreptococcaceae in both 
samples). That said, in the majority of cases where a family with 
probes designed on the array was declared present by BLAST/ 
MEGAN4, but not called with LLMDA, a closely- related taxon was 
called (e.g., in both samples, Clostridiaceae was called although its 
close relative Peptostreptococcaceae was not). 

To better understand the data used by MEGAN4 to call the family 
Peptostreptococcaceae as present, we examined the ribosomal RNA 
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Figure 3 | Comparison of HTS and LLMDA results for plague victim 8291. See caption for Figure 2. 



gene, and other feature annotations for the mapped read positions in 
Clostridium difficile strain 630 (RefSeq accession NC_009089.1), one 
of the fully sequenced genomes in this family. Notably, 915 of 1328 
(69%) reads mapping to this genome from cholera victim 3090.13 
and 146 of 319 (46%) from plague victim 8291 were within rRNA 
genes. Since rRNA genes only cover 1.1% of the C. difficile 630 
genome, these read counts are much higher than would be expected 
by chance alone. Consequently, we suspect that a large part of the 
data used by MEGAN4 to call this family as present is based on reads 
that map to highly conserved genes, and could also support the 
presence of a related taxon. Although a detailed analysis of 
MEGAN4 performance is beyond the scope of this study, our pre- 
liminary results suggest that its relative non-specificity could under- 
lie some of the discrepancies between HTS and microarray 
identifications. 

We also considered the possibility that relatively low GC content 
of the targets could compromise hybridization-based LLMDA detec- 
tion. Average log (fluorescence) intensity of probes for a given taxon 
is strongly correlated with the average GC% of that probe set (r = 



0.56, p = 0.0028, R 2 = 0.368 for cholera victim 3090.13; r = 0.65, p = 
2.5 X 10~ 13 , R 2 = 0.653 for plague victim 8291; Fig. S3), but LLMDA 
detected taxa across the range of average log intensities. 
Furthermore, for taxa used for probe design, there was no significant 
difference in GC content between LLMDA-positive and LLMDA- 
negative HTS reads (two-tailed, unequal variance Student's t-test, p 
= 0.252 for 3090.13, p = 0.779 for 8291; Fig. 4b). This indicates that 
GC content alone cannot explain a taxon's presence or absence from 
the LLMDA calls. We also considered the possibility that confident 
LLMDA identification may be compromised if regional preservation 
or amplification biases reduce the evenness of genomic representa- 
tion amongst the reads. However, for taxa with probes on the array, 
there was no significant difference between the proportions of 
unique genomic bases covered by HTS reads for LLMDA-positive 
and LLMDA-negative taxa (two-tailed, unequal variance Student's t- 
test, p = 0.365 for 3090.13, p = 0.843 for 8291; Fig. 4c). 

Several taxa were detected only with LLMDA. This may suggest 
that the LLMDA is more sensitive than HTS to certain taxa, as a 
rarefaction analysis of the HTS data suggests that in neither sample 
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have all the HTS-detectable families likely been observed at these 
sequencing depths (Fig. 5). Cholera victim 3090.13 in particular 
shows a near-linear rarefaction curve, potentially explaining why it 
has so many more LLMDA-only calls than does plague victim 8291. 
However, taxa detected by both HTS and LLMDA still have signifi- 
cantly higher LLMDA log odds scores than taxa detected by LLMDA 
alone (one-tailed, unequal variance Student's t-test, p = 0.015 for 
3090.13, p = 0.007 for 8291; Fig. 4d). This difference likely reflects the 
fact that LLMDA calls with smaller log odds scores are supported by 
fewer detected probes, and are thus inherentiy less reliable. However, 
the relationship between log odds scores and HTS observations is 
imperfect, as several taxa with relatively high read counts have max- 
imum log odds score values within the range of LLMDA-only calls 
(e.g., Caulobacteraceae for sample 8291 and Moraxellaceae for sam- 
ple 3090.13). Again as noted above, there is no significant difference 
between the proportion of unique genomic bases covered by HTS 
reads for LLMDA-positive and LLMDA-negative taxa (Fig. 4c). 

We have demonstrated that the LLMDA provides similar bacterial 
family-level metagenomic profiles of archaeological and archival 
specimens as HTS, especially for the most abundant taxa, and suc- 
cessfully detected the previously- verified infecting pathogen species 
in both specimens. Furthermore, as demonstrated with cholera vic- 
tim 3090.13, it is potentially capable of detecting bacterial taxa that 
are insufficiently or unable to be detected even with very large HTS 



datasets, due to the very deep sequencing depths required to observe 
low abundance HTS taxa, likely common for many co-infecting 
pathogens in complex aDNA extracts. This is encouraging, since 
LLMDA analysis is at least one order of magnitude less expensive 
and labor-intensive than metagenomic HTS. As such, the technique 
could be productively applied in a number of research settings, 
depending on the specific question and the nature of the specimens. 
For instance, dozens of samples could be rapidly assessed for the 
most abundant pathogen constituents. Use of the LLMDA may also 
integrate well into TE-HTS studies not only by narrowing the range 
of targets for hybridization capture, but also by generating enriched 
libraries via elution from the microarray itself, which can be later 
sequenced. However the profiles generated by the LLMDA and HTS 
are not identical, and criteria for confident taxon identification with 
both platforms remains imperfect. We have shown that no simple 
variable completely explains the signal disparities, and it is likely that 
a combination of analysis techniques, sequence factors such as GC 
content, and probe design drive the disagreements between the 
LLMDA and HTS. Further methodological evaluation may be able 
to refine these disparities. We expect that microarrays will progress 
in the near future to become an excellent screening tool for archae- 
ological samples where microbial profiles can be swiftly, cheaply, and 
accurately reconstructed thereby aiding the elucidation of popu- 
lation health through deep time. 
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Figure 5 | HTS rarefaction analysis. Rarefaction curves showing the 
number of bacterial families represented by at least 5 reads as a percent of 
the total observed families per sample with increasing read depth (0.1% 
increments). Dashed lines represent lines of best-fit; cholera victim 
specimen 3090.13 is a linear curve (R 2 — 0.96936), plague victim specimen 
8291 is a logarithmic curve (R 2 - 0.98217). 

Methods 

Libraries from these specimens were both shotgun HTS sequenced (divided across 
one HiSeq 1000 lane: 141,039,627 reads for cholera victim 3090.13, 122,830,910 reads 
for plague victim 8291) and utilized for LLMDA analysis. HTS datasets were com- 
pared to the NCBI RefSeq database 21 using BLAST 2.2.26+ 19 and the resulting BLAST 
reports were parsed using MEGAN4 v.4.70.4 with the default settings 20 . Taxonomic 
trees were illustrated manually using FigTree (v. 1.4.0; http://tree.bio.ed.ac.uk/ 
software/figtree) based on MEGAN4 results. Indexed libraries were sent to Lawrence 
Livermore National Laboratory (LLNL) for blind analysis using the 12-plex 135K 
Roche NimbleGen version of the LLMDA v5 array, which is designed to target 3521 
vertebrate- infecting species from 215 families (including bacteria, archaea, viruses, 
protozoa and fungi). A brief summary of the LLMDA workflow is as follows: libraries 
are linearly amplified via random hexamers (Cy-3 labeled) to add the necessary 
fluorescent signal, hybridized to the LLMDA array for 65 h, washed, scanned, and 
analysed. Unlike other aDNA experiments utilizing in-solution or array hybridiza- 
tion, "blocking oligonucleotides" were not used, as this is not a standard component 
of the LLMDA procedure. Arrays were analysed using the CLiMax algorithm 12 with 
probe intensity threshold set to the 95 th percentile of negative controls. See 
Supplementary Information for all further details. 
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